Weakly supervised object detection (WSOD), which is the problem of learningdetectors using only image-level labels, has been attracting more and moreinterest. However, this problem is quite challenging due to the lack oflocation supervision. To address this issue, this paper integrates saliencyinto a deep architecture, in which the location in- formation is explored bothexplicitly and implicitly. Specifically, we select highly confident object pro-posals under the guidance of class-specific saliency maps. The locationinformation, together with semantic and saliency information, of the selectedproposals are then used to explicitly supervise the network by imposing twoadditional losses. Meanwhile, a saliency prediction sub-network is built in thearchitecture. The prediction results are used to implicitly guide thelocalization procedure. The entire network is trained end-to-end. Experimentson PASCAL VOC demonstrate that our approach outperforms all state-of-the-arts.
展开▼